Create DataFrame
val empDF = spark.createDataFrame(Seq(
(7369, "SMITH", "CLERK", 7902, "17-Dec-80", 800, 20, 10),
(7499, "ALLEN", "SALESMAN", 7698, "20-Feb-81", 1600, 300, 30),
(7521, "WARD", "SALESMAN", 7698, "22-Feb-81", 1250, 500, 30),
(7566, "JONES", "MANAGER", 7839, "2-Apr-81", 2975, 0, 20),
(7654, "MARTIN", "SALESMAN", 7698, "28-Sep-81", 1250, 1400, 30),
(7698, "BLAKE", "MANAGER", 7839, "1-May-81", 2850, 0, 30),
(7782, "CLARK", "MANAGER", 7839, "9-Jun-81", 2450, 0, 10),
(7788, "SCOTT", "ANALYST", 7566, "19-Apr-87", 3000, 0, 20),
(7839, "KING", "PRESIDENT", 0, "17-Nov-81", 5000, 0, 10),
(7844, "TURNER", "SALESMAN", 7698, "8-Sep-81", 1500, 0, 30),
(7876, "ADAMS", "CLERK", 7788, "23-May-87", 1100, 0, 20)
)).toDF("empno", "ename", "job", "mgr", "hiredate", "sal", "comm", "deptno")
empDF.show
Basic Commands
show: SHOW method is used to display Dataframe records in readable tabular format on stdout
empDF.show

Few things to observe here:
print first n records of DataFrame
head(): The head() function is used to get the first n rows. This function returns the first n rows for the object based on position. It is useful for quickly testing if your object has the right type of data in it. Number of rows to select.
df.head(5)

empDF.printSchema

empDF: org.apache.spark.sql.DataFrame = [empno: int, ename: string ... 6 more fields] res16: String = empno|ename|job|mgr|hiredate|sal|comm|deptno
Select only particular columns from DataFrame
val empDF = spark.createDataFrame(Seq(
(7369, "SMITH", "CLERK", 7902, "17-Dec-80", 800, 20, 10),
(7499, "ALLEN", "SALESMAN", 7698, "20-Feb-81", 1600, 300, 30),
(7521, "WARD", "SALESMAN", 7698, "22-Feb-81", 1250, 500, 30),
(7566, "JONES", "MANAGER", 7839, "2-Apr-81", 2975, 0, 20),
(7654, "MARTIN", "SALESMAN", 7698, "28-Sep-81", 1250, 1400, 30),
(7698, "BLAKE", "MANAGER", 7839, "1-May-81", 2850, 0, 30),
(7782, "CLARK", "MANAGER", 7839, "9-Jun-81", 2450, 0, 10),
(7788, "SCOTT", "ANALYST", 7566, "19-Apr-87", 3000, 0, 20),
(7839, "KING", "PRESIDENT", 0, "17-Nov-81", 5000, 0, 10),
(7844, "TURNER", "SALESMAN", 7698, "8-Sep-81", 1500, 0, 30),
(7876, "ADAMS", "CLERK", 7788, "23-May-87", 1100, 0, 20)
)).toDF("empno", "ename", "job", "mgr", "hiredate", "sal", "comm", "deptno")
empDF.show
Basic Commands
show: SHOW method is used to display Dataframe records in readable tabular format on stdout
empDF.show
Few things to observe here:
- By default, SHOW function will return only 20 records. This is equivalent to Sample/Top/Limit 20 we have in other SQL environment.
- The string which is longer than 20 characters is truncated. Like “William Henry Har…” in place of “William Henry Harrison”. This is equivalent to width/colwidth etc in typical SQL environment.
print first n records of DataFrame
head(): The head() function is used to get the first n rows. This function returns the first n rows for the object based on position. It is useful for quickly testing if your object has the right type of data in it. Number of rows to select.
df.head(5)
empDF.printSchema
Get schema with sructTypes
empDF.schemaData Frame column List
empDF.columns
empDF.columns.mkString(",")
empDF: org.apache.spark.sql.DataFrame = [empno: int, ename: string ... 6 more fields]
res14: String = empno,ename,job,mgr,hiredate,sal,comm,deptno
empDF.columns.mkString("|")
empDF: org.apache.spark.sql.DataFrame = [empno: int, ename: string ... 6 more fields] res16: String = empno|ename|job|mgr|hiredate|sal|comm|deptno
Get column Names and DataTypes
empDF.dtTypesSelect only particular columns from DataFrame
empDF.select("empno","ename","sal").show(5)

ALIAS
var dfEmp = spark.createDataFrame(emp).toDF("employee_id","employee_Name","mgrid")
dfEmp.alias("employee").select($"employee_id".alias("empId"),$"employee_Name".alias("empName"),$"mgrid").show
ALIAS
- Alias is defined in order to make columns or tables more readable or even shorter. If you wish to rename your columns while displaying it to the user or if you are using tables in joins then you may need to have alias for table names.
- Other than making column names or table names more readable, alias also helps in making developer life better by writing smaller table names in join conditions.
- You may have to give alias name to DERIVED table as well in SQL. Now let’s see how to give alias names to columns or tables in Spark SQL. We will use alias() function with column names and table names.
var dfEmp = spark.createDataFrame(emp).toDF("employee_id","employee_Name","mgrid")
dfEmp.alias("employee").select($"employee_id".alias("empId"),$"employee_Name".alias("empName"),$"mgrid").show
No comments:
Post a Comment